CLAIMS 

1 . A method of extracting relevant data, comprising: 

accessing at least a first set of data of a first document, the first document including 
markup language, wherein the first set of data includes selected data of the first document, 
the selected tiata at least partly specifying document data; 

accessing at least a second set of data of a second document, the second document 
including markup language; 

determining an edit sequence between at least part of the first set of data and at 
least part of the second set of data, the edit sequence including any of insertions, deletions, 
1 0 and substitutions* and 

finding corresponding data of the second set of data, the corresponding data having 
a correspondence to the selected data, the correspondence at least partly found by 
determining the edit sequence. 



:o is 



2. The method ofi claim 1, wherein the edit sequence includes none of insertions, 
deletions, and substitutions. 



20 



3. The method of claim 1, wherein the edit sequence includes at least one of one or 
more insertions, one or moite deletions, and one or more substitutions. 

4. The method of claim u, wherein the edit sequence is at least partly determined by 
calculating a total cost, and each of one or more of insertions, deletions, substitutions, and 
matches is associated with one or more costs. 



25 5. The method of claim 4, wherein the one or more costs are at least partly set to 
encourage the edit sequence to include one or more matches between at least some markup 
language from the selected data of the first document and at least some markup language 
from the second document, the markup language including text-based content and tags. 

30 6. The method of claim 4, wherein a first cost is associated with a first match at a first 
distance from a root of a tree representation of some set of data, a second cost is associated 
with a second match at a second distance from a root of a tree representation of some set of 
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data, the first distance is less than the second distance, and the first cost and the second 
cost are set to encourage the first match more than the second match. 



7. 



The method of claim 4, wherein a first cost is associated with a first insertion at a 



first distance from a root of a tree representation of some set of data, a second cost is 
associated with a second insertion at a seco Jd distance from a root of a tree representation 
of some set of data, the first distance is less than the second distance, and the first cost and 
the second cost are different. 



10 8. The method of claim 4, wherein a first cost is associated with a first deletion at a 

first distance from a root of a tree representation of some set of data, a second cost is 

f / 

J associated with a second deletion at a second distance from a root of a tree representation 

3 | 

j of some set of data, the first distance is le£s than the second distance, and the first cost and 

~ the second cost are different. 



:ei5 
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The method of claim 4, wherein a 



first cost is associated with a first substitution at 



a first distance from a root of a tree repi esentation of some set of data, a second cost is 
associated with a second substitution |at a second distance from a root of a tree 
representation of some set of data, the firfet distance is less than the second distance, and 
the first cost and the second cost are different. 



10. The method of claim 4, wherein 
content substitution such that a first 
substantially equal to a first length of 
associated with a second text-based con 
substituting text-based content is substantially 
text-based content, and the first cost and 
text-based content substitution more than 



first cost is associated with a first text-based 
length of substituting text-based content is 
substituted text-based content, a second cost is 
ent substitution such that a second length of 
different from a second length of substituted 
second cost are set to discourage the second 
first text-based content substitution. 



the 



tie 



11. The method of claim 4, wherein markup language includes at least text-based 
content and tags, and the one or more costs are at least partly set to discourage 
substitutions of text-based content for one or more tags. 



Attorney Docket No. 25961-704 
C:\NrPortbl\PALIBl\KS6\1368814 l.DOC 



# 



12. The method of claim 4, wherein markup language includes at least text-based 
content and tags, and the one or more costs are at least partly set to discourage 
substitutions of one or more tags for text-based content 



/ 
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13. The method of claim 4, wherein a first cosUs associated with preserving a first tag 
with unchanged attributes, a second cost is associated with preserving a second tag with 
one or more changed attributes, and the first cosyand the second cost are set to discourage 
preserving the second tag more than preservingthe first tag. 

14. The method of claim 1, wherein document data is at least partly from the first 
document. 



i;5i5 



15. The method of claim 1, wherein document data is at least partly from the second 
document. 



16. The method of claim 1, wherein the second document is received if the second 
document is different from the first document. 



20 17. The method of claim 1, whi 
(Hypertext Markup Language). 



25 



30 



•ein the markup language includes at least HTML 



18. The method of claim 1, wherein the markup language includes at least one of 
XML, a subset of XML, and a speculation of XML (extensible Markup Language). 

19. The method of claim 1 wherein the markup language includes at least WML 
(Wireless Markup Language). 

20. The method of claim 1, w ierein the markup language includes at least one of 
SGML, a subset of SGML, and a specialization of SGML (Standard Generalized Markup 
Language). 
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21. The method of claim 1, wherein the markup language includes at least text-based 
content and tags, the tags detailing one or more of structure of content, semantics of 
content, and formatting information about text-based content. 



10 



22. The method of claim l\ wherein the correspondence is at least partly found by one 
or more of: determining the ediraequence, at least part of at least one of a first plurality of 
paths from a root of a tree representation of the first set of data to selected data of the tree 
representation of the first set of data, at least part of at least one of a second plurality of 
paths from a root of a tree representation of the second set of data to corresponding data of 
the tree representation of the second sot of data, and one or more edit sequences between at 
least one of the first plurality of paths and at least one of the second plurality of paths. 



ii015 



23. The method of claim 1, wherein one or more of the first set of data and the second 
set of data is represented at least partly by a ^fej 

24. The method of claim 1, wherein orie fef^tnore of the first set of data and the second 
set of data is represented at least partly by a set oSlinearized tokens. 



"~ 20 



25. The method of claim 1, wherein at leas\the first document and the second 
document represent different documents. 



26. The method of claim 1, wherein the first document and the second document 
represent a same document. 



25 27. The method of claim 1, wherein the first document and the second document 
represent different versions of a same document. 
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28. A method of extracting relevant data, comprising: 

accessing atUeast a first set of data of a first document, the first document including 
markup language, wherein the first set of data includes selected data of the first document, 
the selected data at least partly specifying document data; 
5 accessing at leabt a second set of data of a second document, the second document 

including markup language; 

determining a tree-based edit sequence between at least part of the first set of data 
and at least part of the second set of data, the tree-based edit sequence including any of 
insertions, deletions, and substitutions; and 
10 finding correspondinAdata of the second set of data, the corresponding data having 

a correspondence to the selected data, the correspondence at least partly found by 
} determining the tree-based edit sequence. 

| 29. The method of claim 28, wherein the tree-based edit sequence includes none of 
1 5 insertions, deletions, and substitutions^ 

30. The method of claim 28, wheip^lWtree-based edit sequence includes at least one 
of one or more insertions, one or morevdgjpnons, and one or more substitutions. 




20 31. The method of claim 28, wherein the tree-based edit sequence is at least partly 

\ 

determined by calculating a total cost, and each of one or more of insertions, deletions, 
substitutions, and matches is associated with one or more costs. 



\ 



32. The method of claim 31, wherein the one, or more costs are at least partly set to 

25 encourage the tree-based edit sequence to include Vne or more matches between at least 

\ 

some markup language from the selected data of the first document and at least some 
markup language from the second document, the markup language including text-based 
content and tags. 

30 33. The method of claim 31, wherein a first cost is associated with a first match at a 

\ 

first distance from a root of a tree representation of some set of data, a second cost is 
associated with a second match at a second distance from a root of a tree representation of 
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some set of data, tme first distance is less than the second distance, and the first cost and 
the second cost are set to encourage the first match more than the second match. 

34. The method of claim 31, wherein a first cost is associated with a first insertion at a 
5 first distance from a root of a tree representation of some set of data, a second cost is 
associated with a second \nsertion at a second distance from a root of a tree representation 
of some set of data, the fir&t distance is less than the second distance, and the first cost and 
the second cost are different 

10 35. The method of claim 3\l, wherein a first cost is associated with a first deletion at a 
first distance from a root of a\tree representation of some set of data, a second cost is 
3 associated with a second deletion at a second distance from a root of a tree representation 
1 of some set of data, the first distance is less than the second distance, and the first cost and 



Ly 

?915 



the second cost are different. 



36. The method of claim 31, wherein a first cost is associated with a first substitution 
at a first distance from a root of a tree/fepresentation of some set of data, a second cost is 



associated with a second substitutioMa?|)a second distance from a root of a tree 
representation of some set of data, the^ftret 
20 the first cost and the second cost are differer 



representation of some set of data, the^ftnst aistance is less than the second distance, and 



\ 



37. The method of claim 31, wherein a fi^t cost is associated with a first text-based 

content substitution such that a first length of substituting text-based content is 

\ 

substantially equal to a first length of substituted text-based content, a second cost is 
25 associated with a second text-based content substitution such that a second length of 
substituting text-based content is substantially different from a second length of substituted 

text-based content, and the first cost and the secondVost are set to discourage the second 

\ 

text-based content substitution more than the first text-based content substitution. 



\ 



30 38. The method of claim 31, wherein markup language includes at least text-based 

\ 

content and tags, and the one or more costs are at least partly set to discourage 
substitutions of text-based content for one or more tags. 
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39. The method of. claim 31, wherein markup language includes at least text-based 
content and tagsXand the one or more costs are at least partly set to discourage 
substitutions of one or more tags for text-based content. 

40. The method oAclaim 31, wherein a first cost is associated with preserving a first 
tag with unchanged attributes, a second cost is associated with preserving a second tag 
with one or more changed attributes, and the first cost and the second cost are set to 
discourage preserving the second tag more than preserving the first tag. 

41. The method of claim\28, wherein document data is at least partly from the first 
document. 



P15 



42. The method of claim 28, \vherein document data is at least partly from the second 
document. 



43. The method of claim 28 
document is different from the first 



rein the second document is received if the second 
document. 



20 44. The method of claim 28, wherenj the markup language includes at least HTML 
(Hypertext Markup Language). 



25 



45. The method of claim 28, wherein th\ markup language includes at least one of 
XML, a subset of XML, and a specialization of\XML (extensible Markup Language). 

46. The method of claim 28, wherein the i^arkup language includes at least WML 
(Wireless Markup Language). 



30 



47. The method of claim 28, wherein the markup language includes at least one of 
SGML, a subset of SGML, and a specialization of S\rML (Standard Generalized Markup 
Language). 
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48. The method of claim 28, wherein the markup language includes at least text-based 
content and tags\ the tags detailing one or more of structure of content, semantics of 
content, and formatting information about text-based content. 

5 49. The method of claim 28, further comprising: 

if two or more corresponding data are found, then: 

selecting, larger selected data, at least part of the larger selected data 
including a larger subtre^ in a first tree representation of the first set of data, the larger 
subtree including the selected data; 
10 determining^ second edit sequence between at least part of the first set of 

data and at least part of a second tree representation of the second tree of data, the first set 

5 \ 

i of data including at least part of the larger selected data, the second edit sequence 

: including any of insertions, deletions, and substitutions; 

■ \ 

I finding corresponding data of the second set of data, the corresponding data having 

U5 a correspondence to the larger selected data, the correspondence at least partly found by 
determining the second edit sequerideVand 

finding corresponding dat^dfj\e second set of data, the corresponding data having 
a correspondence to the seleotedDl^a, the correspondence at least partly found by 
determining the second edit sequence 

20 

50. The method of claim 28, whereimthe correspondence is at least partly found by one 
or more of: determining the tree-based edu sequence, at least part of at least one of a first 
plurality of paths from a root of a tree representation of the first set of data to selected data 
of the tree representation of the first set of lata, at least part of at least one of a second 
25 plurality of paths from a root of a tree representation of the second set of data to 
corresponding data of the tree representation qf the second set of data, and one or more 
tree-based edit sequences between at least one o^the first plurality of paths and at least one 
of the second plurality of paths. 

30 51. The method of claim 28, wherein one or mor\of the first set of data and the second 
set of data is represented at least partly by a tree. 
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52. The metnbd of claim 28, wherein one or more of the first set of data and the second 
set of data is represented at least partly by a set of linearized tokens. 

53. The method of claim 28, wherein at least the first document and the second 
5 document represent different documents. 

54. The method of claim 28, wherein the first document and the second document 
represent a same document 

10 55. The method of clain\28, wherein the first document and the second document 
represent different versions of a same document. 

i* j 56. The method of claim 28, further comprising: 

j;jf determining at least one edk sequence of forward and backward edit sequences 

Ml 5 between at least part of a first tree representation of the first set of data and at least part of 
.g a second tree representation of the seo^ad(set of data; 

performing at least one of 1) amcyny 
rU la) pruning a relevant subtree from at least part of the first tree representation, 

J 3 the relevant subtree at least partly determined from the forward and backward edit 
~ 20 sequences; \ 

lb) determining a pruned edit sequence between the pruned relevant subtree 
and at least part of the second tree representation* 

2a) pruning a relevant subtree from at least part of the second tree 
representation, the relevant subtree at least partw determined from the forward and 
25 backward edit sequences; \ 

2b) determining a pruned edit sequence between at least part of the first tree 
representation and the pruned relevant subtree; and \ 

finding corresponding data of the second set of daia, the corresponding data having 
a correspondence to the selected data, the correspondence at least partly found by 
30 determining the pruned edit sequence. \ 

57. A method of extraction, comprising: \ 
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accessing aftleast a first set of data of a first document, the first document including 
markup language, wherein the first set of data includes selected data, the selected data at 
least partly specifying Mocument data; 

accessing at leas\a second set of data of a second document, the second document 
5 including markup language; 

determining document data of the second set of data, by finding corresponding data 
of the second set of data, th\ corresponding data having a correspondence to the selected 
data of the first set of data; 

identifying the corresponding data of the second set of data as selected data of the 
10 second set of data, the selected data at least partly specifying document data; 

accessing at least a third ^et of data of a third document, the third document 
w including markup language; and 
I j determining document data oft the third set of data, by finding corresponding data 

ijf of the third set of data, the corresponding: data having a correspondence to at least one of 
=915 the selected data of the first set of data aAd the selected data of the second set of data. 

;S 58. The method of claim 57, whferffia^Libsequent sets of data of documents are 
U received, the documents including markijaiahguage, document data of the subsequent sets 
3 of data are determined by finding corresponding data of the subsequent sets of data, the 
~ 20 corresponding data of the subsequent sets correspond to the selected data of earlier sets of 
data, the corresponding data of the subsequent siets are identified as selected data of the 
subsequent sets of data, the selected data of the\ subsequent sets of data at least partly 
specifying document data, and at least one of selected data of the earlier sets and the 
selected data of the subsequent data at least partly determine corresponding data of later 
25 sets of data, the earlier sets of data are received earlierVhan the subsequent sets of data, and 
the later sets of data are received later than the subsequent sets of data. 

59. The method of claim 57, wherein document datayis at least partly from the first 
document. 



30 



60. The method of claim 57, wherein document data is atyeast partly from the second 
document. 
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61. The method of claim 57, wherein document data is at least partly from the third 
document. \ 

5 62. The method ofSelaim 57, wherein the second document is received if the second 
document is different frofcn the first document. 

63. The method of claim 57, wherein the markup language includes at least HTML 
(Hypertext Markup Language)\ 

10 \ 

64. The method of claim 57, Vherein the markup language includes at least one of 
Q XML, a subset of XML, and a specialization of XML (extensible Markup Language). 

l fi 65. The method of claim 57, wherein the markup language includes at least WML 
Ml 5 (Wireless Markup Language). \ 

;=i 66. The method of claim 57, wherein tM markup language includes at least one of 
SGML, a subset of SGML, and a specializatiotiQf^GML (Standard Generalized Markup 
p Language). fxA^ 

67. The method of claim 57, wherein the markupManguage includes at least text-based 
content and tags, the tags detailing one or more of Astructure of content, semantics of 
content, and formatting information about text-based content. 

25 68. The method of claim 57, wherein at least two of the first document, the second 
document, and the third document represent different documems. 

69. The method of claim 57, wherein at least two of the first document, the second 
document, and the third document represent a same document. \ 

30 \ 

70. The method of claim 57, wherein at least two of the first document, the second 
document, and the third document represent different versions of a sam&document. 



Attorney Docket No. 25961-704 
C:\NrPortbl\PALIBl\KS6\1368814_l.DOC 



43 



71. A method\>f extraction, comprising: 

accessing at\east a first set of data of a first document, the first document including 
markup language, wnerein the first set of data includes selected data, the selected data at 
5 least partly specifying document data; 

accessing at leasfta second set of data of a second document, the second document 
including markup language: 

finding one or moreXsets of corresponding data of the second set of data, each of 
one or more sets of corresponding data having a strength of correspondence to the selected 
10 data of the first set of data; \ 

if two or more sets of ^corresponding data are found, then 1) if one of the 
J corresponding sets of data has aVubstantially higher strength of correspondence than 
i"j strengths of correspondence of theVther corresponding sets of data, assigning a high 
^ measure of quality to the selection of the selected data, and 2) assigning a low measure of 
j J0l5 quality to the selection of the selected data, if at least one of: 2a) none of the corresponding 
a sets of data has a substantially higher ^strength of correspondence than strengths of 

correspondence of the other corresponrniSd^ets of data, and 2b) if strengths of 
correspondence of all corresponding sets pftdata^e low. 
I j 72. The method of claim 71, wheren^^cWWt data is at least partly from the first 
20 document. \ 

73. The method of claim 71, wherein documentViata is at least partly from the second 
document. \ 

25 74. The method of claim 71, wherein the second document is received if the second 
document is different from the first document. \ 

75. The method of claim 71, wherein the markup langxiage includes at least HTML 
(Hypertext Markup Language). \ 

30 \ 

76. The method of claim 71, wherein the markup language includes at least one of 
XML, a subset of XML, and a specialization of XML (extensible Markup Language). 
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77. Thk method of claim 71, wherein the markup language includes at least WML 
(Wireless M&rkup Language). 

5 78. The memod of claim 71, wherein the markup language includes at least one of 
SGML, a subset ©f SGML, and a specialization of SGML (Standard Generalized Markup 
Language). \ 

79. The method ok claim 71, wherein the markup language includes at least text-based 
10 content and tags, the tags detailing one or more of structure of content, semantics of 
content, and formatting information about text-based content. 

Q 80. The method of claim 71, wherein the first document and the second document 
represent different documents\ 

I 15 \ 

i5 ~ 81. The method of claim 7 1\ wherein the first document and the second document 
n represent a same document. 

[W 82. The method of claim 7l/wn^in the first document and the second document 
■3 represent different versions of a sj^f^dcument. 

^20 V 

83. A method of extraction, comprising: 

accessing at least a first set of data 6f a first document, the first document including 
markup language, wherein the first set of data includes a first selected subset and a second 
selected subset, such that the second selected Vubset of data is a subset of the first selected 
25 subset of data, the first selected subset at least partly specifying document data, the second 
selected subset at least partly specifying document data; 

accessing at least a second set of data of a Second document, the second document 
including markup language; \ 

determining a first edit sequence between at least part of the first set of data and at 
30 least part of the second set of data, the first edit sequence including any of insertions, 
deletions, and substitutions; \ 
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finding a first corresponding subset of the second set of data, the first 
corresponding \subset having a correspondence to the first selected subset, the 
correspondence at least partly found by determining the first edit sequence; 

determinin&a second edit sequence between at least part of the first set of data and 
5 at least part of the second set of data, the first set of data including at least part of the first 
selected subset, the second set of data including at least part of the first corresponding 
subset, the second edit sequence including any of insertions, deletions, and substitutions; 
and \ 

finding a second corresponding subset of the second set of data, the second 
10 corresponding subset having\a correspondence to the second selected subset, the 
correspondence at least partly found by determining the second edit sequence. 

jT§ 84. The method of claim 83, wherein at least one of the first edit sequence and the 
jlj second edit sequence includes none of\nsertions, deletions, and substitutions. 

ff 15 \ 

S 85. The method of claim 83, whereirk at least one of the first edit sequence and the 

fh second edit sequence includes at least cme of one or more insertions, one or more 
deletions, and one or more substitutions. A\ 

IS S3?: I I^fc /Si 

20 86. The method of claim 83, wherein-^ftedfet one of the first edit sequence and the 
second edit sequence is at least partly deteiminedVby calculating a total cost, and each of 
one or more of insertions, deletions, substitutions, Vid matches is associated with one or 
more costs. \ 

25 87. The method of claim 86, wherein the one or mdre costs are at least partly set to 
encourage the edit sequence to include one or more matches between at least some markup 
language from the selected data of the first document and at least some markup language 
from the second document, the markup language including texVbased content and tags. 

30 88. The method of claim 86, wherein a first cost is associates! with a first match at a 
first distance from a root of a tree representation of some set of\data, a second cost is 
associated with a second match at a second distance from a root of a\ree representation of 
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some set of datA the first distance is less than the second distance, and the first cost and 
the second cost arevset to encourage the first match more than the second match. 

89. The method o^claim 86, wherein a first cost is associated with a first insertion at a 
5 first distance from a root of a tree representation of some set of data, a second cost is 
associated with a second msertion at a second distance from a root of a tree representation 
of some set of data, the firskdistance is less than the second distance, and the first cost and 
the second cost are different. \ 

10 90. The method of claim 86,Vvherein a first cost is associated with a first deletion at a 
first distance from a root of a tr^e representation of some set of data, a second cost is 
'*3 associated with a second deletion a\a second distance from a root of a tree representation 
y of some set of data, the first distance is less than the second distance, and the first cost and 
,«S the second cost are different. \ 

|; : 2l5 91. The method of claim 86, whereima first cost is associated with a first substitution 
at a first distance from a root of a tree representation of some set of data, a second cost is 
m associated with a second substitution at \a /Second distance from a root of a tree 
,z representation of some set of data, the first PW&we is l ess than the second distance, and 

*3 the first cost and the second cost are differeikJ^f \ 

20 \ 

92. The method of claim 86, wherein a first cost is associated with a first text-based 
content substitution such that a first length oA substituting text-based content is 
substantially equal to a first length of substituted t^kt-based content, a second cost is 
associated with a second text-based content substitution such that a second length of 

25 substituting text-based content is substantially different from a second length of substituted 
text-based content, and the first cost and the second cost ate set to discourage the second 
text-based content substitution more than the first text-based Content substitution. 

93. The method of claim 86, wherein markup language includes at least text-based 
30 content and tags, and the one or more costs are at least partly set to discourage 

substitutions of text-based content for one or more tags. \ 
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94. The method of claim 86, wherein markup language includes at least text-based 
content and tags, and the one or more costs are at least partly set to discourage 
substitutions of one onmore tags for text-based content. 

5 95. The method of claim 86, wherein a first cost is associated with preserving a first 
tag with unchanged attributes, a second cost is associated with preserving a second tag 
with one or more changed Wtributes, and the first cost and the second cost are set to 
discourage preserving the second tag more than preserving the first tag. 

10 96. The method of claim 83\ wherein document data is at least partly from the first 
^ document. 

j 97. The method of claim 83, whe\ein document data is at least partly from the second 
document. 



15 



98. The method of claim 83, whereinVhe second document is received if the second 
document is different from the first documej 



99. The method of claim 83, whereir 
20 (Hypertext Markup Language). 



iarkup language includes at least HTML 



100. The method of claim 83, wherein the markup language includes at least one of 
XML, a subset of XML, and a specialization of XMLYeXtensible Markup Language). 

25 101. The method of claim 83, wherein the markup \anguage includes at least WML 
(Wireless Markup Language). 



102. The method of claim 83, wherein the markup language includes at least one of 
SGML, a subset of SGML, and a specialization of SGML (Standard Generalized Markup 
30 Language). 
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103. The method of claim 83, wherein the markup language includes at least text-based 
content and tags, the Vags detailing one or more of structure of content, semantics of 
content, and formatting information about text-based content. 

104. The method of claim 83, further comprising: 

if two or more corresponding data are found, then: 

selecting largefv selected data, at least part of the larger selected data 
including a larger subtree in a Virst tree representation of the first set of data, the larger 
subtree including the selected datsu 

determining a third\edit sequence between at least part of the first set of 
data and at least part of a second tree representation of the second set of data, the first set 
of data including at least part of the larger selected data, the third edit sequence including 
any of insertions, deletions, and substitutions; 

finding corresponding data of thasecond set of data, the corresponding data having 
a correspondence to the larger selected cjflg^the correspondence at least partly found by 
determining the third edit sequence; andy 

finding corresponding data of th&^cratt set of data, the corresponding data having 
a correspondence to the selected data, they correspondence at least partly found by 
determining the third edit sequence. 

105. The method of claim 83, wherein one or m\re of the first set of data and the second 
set of data is represented at least partly by a tree. 



106. The method of claim 83, wherein one or more of the first set of data and the second 
25 set of data is represented at least partly by a set of linear&ed tokens. 

107. The method of claim 83, wherein the first document and the second document 
represent different documents. 



30 108. The method of claim 83, wherein the first document \nd the second document 
represent a same document. 
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109. The metKod of claim 83, wherein the first document and the second document 
represent differentWersions of a same document. 

110. The method of claim 83, wherein at least one of the first edit sequence and the 
5 second edit sequence includes a tree-based edit sequence. 

111. The method of clarni 83, wherein at least one of determining the first edit sequence 
and determining the secondWit sequence comprises: 

determining at least one edit sequence of forward and backward edit sequences 
10 between at least part of a first \ree representation of the first set of data and at least part of 
i8B% a second tree representation of trie second set of data; 
W performing at least one of u) and 2): 

m la) pruning a relevant subtree from at least part of the first tree representation, 

;=S the relevant subtree at least partl\ determined from the forward and backward edit 
ji 15 sequences; \ 

lb) determining a pruned edit sequence between the pruned relevant subtree 
r g and at least part of the second tree representation; 

[J£ 2a) pruning a relevant subtjje3[\from at least part of the second tree 

*3 representation, the relevant subtree at (least {partly determined from the forward and 

20 backward edit sequences; \ \ 

2b) determining a pruned edit seqtflence between at least part of the first tree 
representation and the pruned relevant subtree; amd 

finding corresponding data of the second set of data, the corresponding data having 
a correspondence to the selected data, the correspondence at least partly found by 
25 determining the pruned edit sequence. \ 

112. A method of extraction, comprising: \ 

accessing at least a plurality of first sets of dataVf a plurality of first documents, 
the first documents including markup language, wherein each of the plurality of first sets 
30 of data includes selected data, the selected data at least partly specifying document data; 

accessing at least a second set of data of a second document, the second document 
including markup language; \ 
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determining la most corresponding first set of data of the plurality of first sets of 
data, the most corresponding first set of data having most correspondence with the second 
set of data, by comparing partial representations of the plurality of first sets of data with a 
partial representation of me second set of data. 

113. The method of claim 112, wherein document data is at least partly from one or 
more of the plurality of first opcuments. 

1 14. The method of claim 1 0, wherein document data is at least partly from the second 
10 document. \ 

115. The method of claim 112, wherein the second document is received if the second 
l J document is different from at least ona of the plurality of first documents. 

15 116. The method of claim 112, wherein the second document is received if the second 
': a document is different from all of the pl ur 9f^\f f irst documents. 

=i 117. The method of claim 112, wherein th\ markup language includes at least HTML 

^ (Hypertext Markup Language). \ 

20 \ 

118. The method of claim 112, wherein the markup language includes at least one of 
XML, a subset of XML, and a specialization of XM A (extensible Markup Language). 

119. The method of claim 112, wherein the markup^ language includes at least WML 
25 (Wireless Markup Language). \ 

120. The method of claim 112, wherein the markup language includes at least one of 
SGML, a subset of SGML, and a specialization of SGML (Standard Generalized Markup 
Language). \ 

30 \ 
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121 . The metftod of claim 112, wherein the markup language includes at least text-based 
content and tags, \he tags detailing one or more of structure of content, semantics of 
content, and formatting information about text-based content. 

5 122. The method of claim 112, wherein the partial representation of the second set of 
data includes a hash value computed on at least part of the second set of data. 

123. The method of claim 112, wherein a partial representation of a first set of data of 
the plurality of first sets of data includes a hash value computed on at least part of the first 
10 set of data of the plurality of first serts of data. 

=3 124. The method of claim 112, whVein the partial representation of the second set of 
I J data includes at least a partial syntax tre^of the second set of data. 

j;^ 15 125. The method of claim 112, wherein k partial representation of a first set of data of 
;3 the plurality of first sets of data includes at leW^partial syntax tree of the first set of data 

j- g of the plurality of first sets of data. y VjL 

"3 126. The method of claim 112, wherein thl/j5anial representation of the second set of 

20 data includes a hash value computed on at least a partial syntax tree of the second set of 
data. \ 

127. The method of claim 112, wherein a partial representation of a first set of data of 
the plurality of first sets of data includes a hash value compVted on at least a partial syntax 

25 tree of the first set of data of the plurality of first sets of data. \ 

128. The method of claim 112, wherein the partial representation of the second set of 
data includes at least one of a part of a name of the second set of dkta and a part of a name 
of the second document. \ 

30 \ 

129. The method of claim 112, wherein a partial representation of Afirst set of data of 
the plurality of first sets of data of first documents includes at least on&of 1) a part of a 
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name of the first setW data of the plurality of first sets of data and 2) a part of a name of a 
first document of the Yirst documents, the first document of the first documents including 
the first set of data of the plurality of first sets of data. 

5 \ 

130. The method of claim 1 12, wherein at least two documents out of the first plurality 
of documents and the second ^document represent different documents. 

131. The method of claim 112, wherein at least two documents out of the first plurality 
10 of documents and the second docqment represent a same document. 

=3 132. The method of claim 1 12, wnerein at least two documents out of the first plurality 
Ld of documents and the second document represent different versions of a same document. 

1 5 133. A method of extraction, comprising: 
» accessing at least a first tree oA date of a first document, the first document 

:=S including markup language, wherein the \first) tree of data includes selected data, the 

] Ji selected data at least partly specifying docuppfkdata; 

;j accessing at least a second tree of ditaxff a second document, the second document 

20 including markup language; ^ \ \ 

determining at least one edit sequence ©f forward and backward edit sequences 
between at least part of the first tree and at least part of the second tree; 
performing at least one of 1) and 2): \ 

la) pruning a relevant subtree from at least part of the first tree, the relevant 
25 subtree at least partly determined from the forward anci backward edit sequences; 

lb) determining a pruned edit sequence between the pruned relevant subtree 
and at least part of the second tree; \ 

2a) pruning a relevant subtree from at least part of the second tree, the relevant 
subtree at least partly determined from the forward and bacWard edit sequences; 
30 2b) determining a pruned edit sequence between at least part of the first tree 

and the pruned relevant subtree; and \ 
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finding corresponding data of the second set of data, the corresponding data having 
a correspondence ta the selected data, the correspondence at least partly found by 
determining the pruneckedit sequence. 

5 134. The method of clatpi 133, wherein document data is at least partly from the first 
document. 

135. The method of claim 13^, wherein document data is at least partly from the second 
document. 



10 



136. The method of claim 133, wherein the second document is received if the second 
document is different from the first document. 



:? is 



137. The method of claim 133, wherenj the markup language includes at least HTML 
(Hypertext Markup Language). 



i! H 



20 



138. The method of claim 133, wherein theVnarkup language includes at least one of 
XML, a subset of XML, and a specialization of(5^L^eXtensible Markup Language). 

139. 11000, wherein the markup language in)flu\es Jti least WML (Wireless Markup 
Language). 



140. The method of claim 133, wherein the markup language includes at least one of 
SGML, a subset of SGML, and a specialization of SGML\standard Generalized Markup 
25 Language). 



30 



141. The method of claim 133, wherein the markup language includes at least text-based 
content and tags, the tags detailing one or more of structure of content, semantics of 
content, and formatting information about text-based content. 

142. The method of claim 133, wherein determining forward knd backward edit 
sequences, pruning a relevant subtree, and determining a pruned edit sequence are 
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performed for eacmof a plurality of subtree pairs, each of the plurality of subtree pairs 
including a subtree from the first tree and a subtree from the second tree. 

143. The method of\laim 133, wherein the first document and the second document 
represent different documents. 

5 \ 

144. The method of clairm 133, wherein the first document and the second document 
represent a same document. \ 

145. 133, wherein the first document and the second document represent different 
10 versions of a same document. \ 

^ 3 1 46. A method of extracting relevant data, comprising: 

[J accessing at least a first set of dam of a first document, the first document including 

markup language, wherein the first set of Mata includes selected data of the first document, 
£8 15 the selected data at least partly specifying document data; 

□ accessing at least a second set of dataupf aisecond document, the second document 
jJfl including markup language; 

□ determining a first edit sequence betwdfefait least part of the first set of data and at 
least part of the second set of data, the first^odt sequence including any of insertions, 

20 deletions, and substitutions; \ 

finding corresponding data of the second setW data, the corresponding data having 
a correspondence to the selected data, the correspondence at least partly found by 
determining the first edit sequence; \ 
if two or more corresponding data are found, then: 
25 selecting larger selected data, at least nart of the larger selected data 

including a larger subtree in a tree representation of the first set of data, the larger subtree 
including the selected data; \ 

determining a second edit sequence between at least part of the first set of 
data and at least part of the second set of data, the first set of data including at least part of 
30 the larger selected data, the second edit sequence including aiw of insertions, deletions, 
and substitutions; \ 
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finding corresponding data of the second set of data, the corresponding data having 
a correspondence to the larger selected data, the correspondence at least partly found by 
determining the second edit sequence; and 

finding corresponding data of the second set of data, the corresponding data having 
a correspondence to the\selected data, the correspondence at least partly found by 
determining the second edit sequence. 

147. The method of claim 1^6, wherein document data is at least partly from the first 
document. 

148. The method of claim 146, wherein document data is at least partly from the second 
document. 



15 



149. The method of claim 146, wherein the second document is received if the second 
document is different from the first document. 



150. The method of claim 146, wherein t\e rn^r^up language includes at least HTML 
(Hypertext Markup Language). 



20 151. The method of claim 146, wherein the 
XML, a subset of XML, and a specialization-^ 



cup language includes at least one of 
(extensible Markup Language). 



25 



152. The method of claim 146, wherein the marktijp language includes at least WML 
(Wireless Markup Language). 

153. The method of claim 146, wherein the markup language includes at least one of 
SGML, a subset of SGML, and a specialization of SGML\Standard Generalized Markup 
Language). 



30 154. The method of claim 146, wherein the markup language Wludes at least text-based 
content and tags, the tags detailing one or more of structure \jjf content, semantics of 
content, and formatting information about text-based content. 
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155. The method of claim 146, wherein the first document and the second document 
represent different documents. 

5 156. The method of claim 146, wherein the first document and the second document 
represent a same document. \ 

157. The method of claim 146, wherein the first document and the second document 
represent different versions of a^ame document. 

10 \ 

158. A method of extraction, comprising: 

hj accessing at least a first free of data of a first document, the first document 

Q including markup language, wherein the first tree of data includes selected data, the 
selected data at least partly specifyingMocument data; 
15 accessing at least a second tree Af data of a second document, the second document 

=i including markup language; Y^i 

jsS performing tree traversal on at \IW^m of the second tree, the tree traversal at 

s U least partly guided by the selected data am by at labst part of the first tree; and 

□ if tree traversal fails due to one Ar m&re differences between at least part of the 

20 second tree and at least part of the selectea data, then: 

determining an edit sequence between at least part of the second tree and at 
least part of the first tree, the first tree including at least part of the selected data; 

finding corresponding data for Tat least part of the second tree, the 
corresponding data having a correspondence to at least part of the selected data, the 
25 correspondence at least partly found by determining the edit sequence; and 

continuing to perform tree traversal on at least part of the second tree, the 
tree traversal at least partly guided by the corresponding data. 

159. The method of claim 158, wherein for subsequent set tree traversal failures, 
determining, finding and continuing are repeated. \ 

30 \ 
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160. The methqd of claim 158, wherein document data is at least partly from the first 
document. 

161. The method o^ claim 158, wherein document data is at least partly from the second 
5 document. 

162. The method of claim 158, wherein the second document is received if the second 
document is different fromVhe first document. 

10 163. The method of claim y 58, wherein the markup language includes at least HTML 
. (Hypertext Markup Language).) 



15 



20 



164. The method of claim 1 5 Si wherein the markup language includes at least one of 
XML, a subset of XML, and a specialization of XML (extensible Markup Language). 

165. The method of claim 158, wherein the markup language includes at least WML 
(Wireless Markup Language). 

166. The method of claim 158, wheMnUtfe markup language includes at least one of 
SGML, a subset of SGML, and a specialy afipn/f SGML (Standard Generalized Markup 
Language). ) . 



167. The method of claim 158, wherein the markup language includes at least text-based 
content and tags, the tags detailing one or moreW structure of content, semantics of 

25 content, and formatting information about text-based eContent. 

168. The method of claim 158, wherein the first document and the second document 
represent different documents. 



30 169. The method of claim 158, wherein the first document and the second document 
represent a same document. 
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170. The methodW claim 158, wherein the first document and the second document 
represent different versions of a same document. 

171. A method of extracting relevant data, comprising: 

accessing at leas&a first set of data of a first document, the first document including 
markup language, wherein the first set of data includes selected data of the first document, 
the selected data at least partly specifying document data; 

accessing at least a ^econd set of data of a second document, the second document 
including markup language; 

determining an edit sequence between the first set of data and the second set of 
data, the edit sequence including any of insertions, deletions, and substitutions; and 

if the edit sequence faiMa test, determining a tree-based edit sequence between the 
first set of data and the second Sft of data, the edit sequence including any of insertions, 
deletions, and substitutions. 

172. The method of claim 171, v)|ierein document data is at least partly from the first 
document. 



1 73 . The method of claim 171, where 
document. 



document data is at least partly from the second 



174. The method of claim 171, wherein tme second document is received if the second 
document is different from the first documentX 

175. The method of claim 171, wherein the Markup language includes at least HTML 
(Hypertext Markup Language). 

176. The method of claim 171, wherein the maidcup language includes at least one of 
XML, a subset of XML, and a specialization of XMH (extensible Markup Language). 



177. The method of claim 171, wherein the markup language includes at least WML 
(Wireless Markup Language). 
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178. The method of claim 171, wherein the markup language includes at least one of 
SGML, a subset oV SGML, and a specialization of SGML (Standard Generalized Markup 
Language). \ 

5 \ 

179. The method of claim 171, wherein the markup language includes at least text-based 
content and tags, the\tags detailing one or more of structure of content, semantics of 
content, and formattingVnformation about text-based content. 

10 180. The method of claim 171, wherein the first document and the second document 
represent different documents. 

;7i 181. The method of claim vJl, wherein the first document and the second document 

represent a same document. \ 
M 15 \ 

i? * 182. The method of claim 171, Wieretn the first document and the second document 

j'JJJ represent different versions of a sam^o^ttment. 

l!3 183. A method of extraction, commsmg: 

J " 20 accessing at least a first set of dataW a first document, the first document including 

markup language, wherein the fI5rst set of \iata includes selected data, the selected data at 
least partly specifying document data; \ 

accessing at least a second set of data of a second document, the second document 
including markup language; \ 

25 determining document data of the second>set of data, by finding corresponding data 

of the second set of data, the corresponding data laaving a correspondence to the selected 
data of the first set of data, the correspondence atueast partly determined by a first edit 
sequence between at least part of the first set of data\and at least part of the second set of 
data, the first edit sequence including any of insertions,\deletions, and substitutions; 

30 identifying the corresponding data of the second\set of data as selected data of the 

second set of data, the selected data at least partly specifying document data; 
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accessing V least a third set of data of a third document, the third document 
including markup language; and 

determining document data of the third set of data, by finding corresponding data 
of the third set of dataAthe corresponding data having a correspondence to at least one of 
the selected data of the first set of data and the selected data of the second set of data, the 
correspondence at least partly determined by a second edit sequence between at least part 
of the third set of data ancft at least one of at least part of the first set of data and at least 
part of the second set of data, the second edit sequence including any of insertions, 
deletions, and substitutions. \ 

184. The method of claim 185, wherein at least one of the first edit sequence and the 
second edit sequence includes noire of insertions, deletions, and substitutions. 

185. The method of claim 183, wherein at least one of the first edit sequence and the 
second edit sequence includes at lelast one of one or more insertions, one or more 
deletions, and one or more substitutions 

186. The method of claim 183, wherein a/ least one of the first edit sequence and the 
second edit sequence is at least partly aetefrnin^d by calculating a total cost, and each of 
one or more of insertions, deletions, suBstmions, and matches is associated with one or 
more costs. \ 

187. The method of claim 185, wherein the \me or more costs are at least partly set to 
encourage the edit sequence to include one or mote matches between at least some markup 
language from the selected data of the first document and at least some markup language 
from the second document, the markup language including text-based content and tags. 

188. The method of claim 185, wherein a first cost\s associated with a first match at a 
first distance from a root of a tree representation of some set of data, a second cost is 
associated with a second match at a second distance frorA a root of a tree representation of 
some set of data, the first distance is less than the second\distance, and the first cost and 
the second cost are set to encourage the first match more thafa the second match. 
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189. The method ofplaim 185, wherein a first cost is associated with a first insertion at 
a first distance from a Jpot of a tree representation of some set of data, a second cost is 
associated with a second\nsertion at a second distance from a root of a tree representation 
of some set of data, the first distance is less than the second distance, and the first cost and 
the second cost are differentX 



10 



190. The method of claim 1816, wherein a first cost is associated with a first deletion at a 
first distance from a root of a free representation of some set of data, a second cost is 
associated with a second deletion W a second distance from a root of a tree representation 
of some set of data, the first distance is less than the second distance, and the first cost and 
the second cost are different. 



Sis 

o 



20 



191. The method of claim 185, wherein a first cost is associated with a first substitution 
at a first distance from a root of a tree representation of some set of data, a second cost is 
associated with a second substitution^ at a second distance from a root of a tree 
representation of some set of data, the first ^tance is less than the second distance, and 
the first cost and the second cost are different. 



192. The method of claim 185, whereim 
content substitution such that a first 



25 



5t is associated with a first text-based 
of substituting text-based content is 
substantially equal to a first length of substituted text-based content, a second cost is 
associated with a second text-based content substitution such that a second length of 
substituting text-based content is substantially different from a second length of substituted 
text-based content, and the first cost and the seconti cost are set to discourage the second 
text-based content substitution more than the first te?ct-based content substitution. 



193. The method of claim 185, wherein markup language includes at least text-based 
content and tags, and the one or more costs are a^ least partly set to discourage 
substitutions of text-based content for one or more tags. 



30 
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194. The method of claim 185, wherein markup language includes at least text-based 
content and tags, and the one or more costs are at least partly set to discourage 
substitutions of one or more tags for text-based content. 

195. The method of clami 185, wherein a first cost is associated with preserving a first 
tag with unchanged attributes, a second cost is associated with preserving a second tag 
with one or more changed attributes, and the first cost and the second cost are set to 
discourage preserving the second tag more than preserving the first tag. 



10 



W 15 
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196. The method of claim 1B3, wherein subsequent sets of data of documents are 
received, the documents including markup language, document data of the subsequent sets 
of data are determined by finding ^corresponding data of the subsequent sets of data, the 
corresponding data of the subsequent sets correspond to the selected data of earlier sets of 
data, the corresponding data of the Subsequent sets are identified as selected data of the 
subsequent sets of data, the selected\data of the subsequent sets of data at least partly 
specifying document data, and at least one of selected data of the earlier sets and the 
selected data of the subsequent data at teast partly determine corresponding data of later 
sets of data, the earlier sets of data are recWedJearlier than the subsequent sets of data, and 
the later sets of data are received later thafiftljesubsequent sets of data. 

197. The method of claim 183, whereiV^pcument data is at least partly from the first 
document. 



198. The method of claim 183, wherein document data is at least partly from the second 
25 document. 

199. The method of claim 183, wherein documeii( data is at least partly from the third 
document. 



30 200. The method of claim 183, wherein the second document is received if the second 
document is different from the first document. 
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201. The method oY claim 183, wherein the markup language includes at least HTML 
(Hypertext Markup Language). 

202. The method of claim 183, wherein the markup language includes at least one of 
XML, a subset of XML, aAd a specialization of XML (extensible Markup Language). 

203. The method of claim 183, wherein the markup language includes at least WML 
(Wireless Markup Language)\ 

204. The method of claim lv53, wherein the markup language includes at least one of 
SGML, a subset of SGML, and\a specialization of SGML (Standard Generalized Markup 
Language). \ 

205. The method of claim 183, Wherein the markup language includes at least text-based 
content and tags, the tags detailing one or more of structure of content, semantics of 
content, and formatting information atfrout text-based content. 

206. The method of claim 1 83, furthe^onjprising: 

if two or more corresponding datjaare^eund, then: 

selecting larger selected flata/at least part of the larger selected data 
including a larger subtree in a first trej^ftoresentation of the first set of data, the larger 
subtree including the selected data; \ 

determining a third edit sequence between at least part of the first set of 
data and at least part of a second tree representation of the second set of data, the first set 
of data including at least part of the larger selected data, the third edit sequence including 
any of insertions, deletions, and substitutions; \ 

finding corresponding data of the second\set of data, the corresponding data having 
a correspondence to the larger selected data, th A correspondence at least partly found by 
determining the third edit sequence; and \ 

finding corresponding data of the second set of data, the corresponding data having 
a correspondence to the selected data, the correspondence at least partly found by 
determining the third edit sequence. \ 
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207. The method oV claim 183, wherein one or more of the first set of data and the 
second set of data is represented at least partly by a tree. 

208. The method of clVim 183, wherein one or more of the first set of data and the 
second set of data is represented at least partly by a set of linearized tokens. 

209. The method of claim \83, wherein at least two of the first document, the second 
document, and the third document represent different documents. 

210. The method of claim 183, Wherein at least two of the first document, the second 
document, and the third document represent a same document. 

211. The method of claim 183, wherein at least two of the first document, the second 
document, and the third document represent different versions of a same document. 

212. The method of claim 183, wheretn\at lfeast one of the first edit sequence and the 
second edit sequence includes a tree-based a3it sequence. 

213. The method of claim 183, wherein determining the edit sequence comprises: 
determining at least one edit sequence of forward and backward edit sequences 

between at least part of a first tree representation oV the first set of data and at least part of 
a second tree representation of the second set of data! 
performing at least one of 1) and 2): \ 

la) pruning a relevant subtree from at leastpart of the first tree representation, 
the relevant subtree at least partly determined from We forward and backward edit 
sequences; \ 

lb) determining a pruned edit sequence between the pruned relevant subtree 
and at least part of the second tree representation; \ 

2a) pruning a relevant subtree from at leasA part of the second tree 
representation, the relevant subtree at least partly determined from the forward and 
backward edit sequences; \ 



Attorney Docket No. 25961-704 
C:\NrPortbl\PALIB 1 \KS6\1 3688 1 4_1 .DOC 



65 



# • 

2b) determining a pruned edit sequence between at least part of the first tree 
representation and the pruned relevant subtree; and 

finding corresponding data of the second set of data, the corresponding data having 
a correspondence to tn& selected data, the correspondence at least partly found by 
5 determining the pruned edit sequence. 

214. A method of extraction, comprising: 

accessing at least a fitet set of data of a first document, the first document including 
markup language, wherein the first set of data includes selected data, the selected data at 
1 0 least partly specifying document data; 

accessing at least a second set of data of a second document, the second document 
kj including markup language; \ 

finding one or more sets oS corresponding data of the second set of data, each of 
one or more sets of corresponding data having a strength of correspondence to the selected 
15 data of the first set of data, the strengftic^ correspondence at least partly determined by an 
j s "~ edit sequence between at least part of tli^second set of data and at least part of the first set 
!;2 of data, the edit sequence including amyaf insertions, deletions, and substitutions; 
ru if two or more sets of conW^iding data are found, then 1) if one of the 

*5 corresponding sets of data has a substantially higher strength of correspondence than 

_=s 20 strengths of correspondence of the other corresponding sets of data, assigning a high 
measure of quality to the selection of the selected data, and 2) if none of the corresponding 
sets of data has a substantially higher strength of correspondence than strengths of 
correspondence of the other corresponding sets of data, assigning a low measure of quality 
to the selection of the selected data. \ 
25 \ 

215. The method of claim 214, wherein the edit Sequence includes none of insertions, 
deletions, and substitutions. \ 

216. The method of claim 214, wherein the edit seqiWce includes at least one of one or 
30 more insertions, one or more deletions, and one or more\substitutions. 
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217. The method of claim 214, wherein the edit sequence is at least partly determined by 
calculating a total cost\and each of one or more of insertions, deletions, substitutions, and 
matches is associated wkh one or more costs. 



218. The method of claim 217, wherein the one or more costs are at least partly set to 
encourage the edit sequencAto include one or more matches between at least some markup 
language from the selected (rata of the first document and at least some markup language 
from the second document, thamarkup language including text-based content and tags. 



10 219. The method of claim 21 A wherein a first cost is associated with a first match at a 
first distance from a root of a tr^e representation of some set of data, a second cost is 
t associated with a second match at a\second distance from a root of a tree representation of 
i some set of data, the first distance ia less than the second distance, and the first cost and 
the second cost are set to encourage th\ first match more than the second match. 

220. The method of claim 217, wherei^i a first cost is associated with a first insertion at 
a first distance from a root of a tree repr&s(enjation of some set of data, a second cost is 
associated with a second insertion at a sedjpV^isjtance from a root of a tree representation 
of some set of data, the first distance is les^t^uyf the second distance, and the first cost and 
20 the second cost are different 



ml 5 



25 



221 The method of claim 217, wherein a first Vost is associated with a first deletion at a 
first distance from a root of a tree representation of some set of data, a second cost is 
associated with a second deletion at a second distance from a root of a tree representation 
of some set of data, the first distance is less than the^econd distance, and the first cost and 
the second cost are different. 



222. The method of claim 217, wherein a first cost isWociated with a first substitution 
at a first distance from a root of a tree representation of spme set of data, a second cost is 
30 associated with a second substitution at a second distance from a root of a tree 
representation of some set of data, the first distance is lessVthan the second distance, and 
the first cost and the second cost are different. 
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223. The method of claim 217, wherein a first cost is associated with a first text-based 
content substitjuticm such that a first length of substituting text-based content is 
substantially equal tb a first length of substituted text-based content, a second cost is 
associated with a second text-based content substitution such that a second length of 
substituting text-basedVontent is substantially different from a second length of substituted 
text-based content, and\he first cost and the second cost are set to discourage the second 
text-based content substitution more than the first text-based content substitution. 

224. The method of claim 217, wherein markup language includes at least text-based 
content and tags, and the Ouie or more costs are at least partly set to discourage 
substitutions of text-based content for one or more tags. 

225. The method of claim 2 17, Wherein markup language includes at least text-based 
content and tags, and the one on more costs are at least partly set to discourage 
substitutions of one or more tags for t^tt-based content. 

226. The method of claim 217, whereiVa fitfst cost is associated with preserving a first 
tag with unchanged attributes, a second fcfe^Fis^ssociated with preserving a second tag 
with one or more changed attributes, ami the/first cost and the second cost are set to 
discourage preserving the second tag more\fctfan preserving the first tag. 

227. The method of claim 214, wherein document data is at least partly from the first 
document. \ 

228. The method of claim 214, wherein document ciata is at least partly from the second 
document. \ 

229. The method of claim 214, wherein the second doWuent is received if the second 
document is different from the first document. \ 



Attorney Docket No. 25961-704 
C:\NrPortbl\PALIB 1\KS6\1 3688 1 4_1 .DOC 



68 



230. The method W claim 214, wherein the markup language includes at least HTML 
(Hypertext Markup Language). 

231. The method ofVlaim 214, wherein the markup language includes at least one of 
5 XML, a subset of XML, and a specialization of XML (extensible Markup Language). 

232. The method of clayn 214, wherein the markup language includes at least WML 
(Wireless Markup Language) 

10 233. The method of claim 214, wherein the markup language includes at least one of 
SGML, a subset of SGML, and \ specialization of SGML (Standard Generalized Markup 
1=3 Language). 

J 5 j 234. The method of claim 214, wherein the markup language includes at least text-based 
;; jfl5 content and tags, the tags detailing\one or more of structure of content, semantics of 
£9 content, and formatting information about text-based content. 



prising: 
re/found, then: 

'at least part of the larger selected data 
tesentation of the first set of data, the larger 



!^ 235. The method of claim 214, furthe^ 
Si 3 if two or more corresponding data 

!~S20 selecting larger selected 13 

including a larger subtree in a first tree 
subtree including the selected data; 

determining a second edit sequence between at least part of the first set of 
data and at least part of a second tree representation of the second set of data, the first set 
25 of data including at least part of the larger selected data, the second edit sequence 
including any of insertions, deletions, and substitute 

finding corresponding data of the second setW data, the corresponding data having 
a correspondence to the larger selected data, the co^fespondence at least partly found by 
determining the second edit sequence; and 
30 finding corresponding data of the second set of Aata, the corresponding data having 

a correspondence to the selected data, the correspondence at least partly found by 
determining the second edit sequence. 
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236. The method \of claim 214, wherein one or more of the first set of data and the 
second set of data is represented at least partly by a tree. 

5 237. The method ofAclaim 214, wherein one or more of the first set of data and the 
second set of data is represented at least partly by a set of linearized tokens. 

238. The method of claim 214, wherein the first document and the second document 
represent different documents. 

10 \ 

239. The method of claim 214, wherein the first document and the second document 
C3 represent a same document. \ 

j'j 240. The method of claim 214, wherein the first document and the second document 
;**L5 represent different versions of a same document. 

?3 241. The method of claim 214, wherein at least one of the first edit sequence and the 
^ second edit sequence includes a tree-oa^echedit sequence. 

jl220 242. The method of claim 214, wh^&n det*mining the edit sequence comprises: 

determining at least one editiseau^nce of forward and backward edit sequences 
between at least part of a first tree representation of the first set of data and at least part of 
a second tree representation of the second set of data; 
performing at least one of 1) and 2) A 
25 la) pruning a relevant subtree frotn at least part of the first tree representation, 

the relevant subtree at least partly determined from the forward and backward edit 
sequences; \ 

lb) determining a pruned edit sequence between the pruned relevant subtree 
and at least part of the second tree representationA 
30 2a) pruning a relevant subtree from at least part of the second tree 

representation, the relevant subtree at least partto determined from the forward and 
backward edit sequences; \ 
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2b) determining a pruned edit sequence between at least part of the first tree 
representation and the pruned relevant subtree; and 

finding corresponding data of the second set of data, the corresponding data having 
a correspondence to the\ selected data, the correspondence at least partly found by 
determining the pruned editWquence. 

243. A method of extraction, comprising: 

accessing at least a firsttset of data of a first document, the first document including 
markup language, wherein the first set of data includes a first selected subset and a second 
selected subset, such that the second selected subset of data is a subset of the first selected 
subset of data, the first selected sAbset at least partly specifying document data, the second 
selected subset at least partly specifying document data; 

accessing at least a second s\t of data of a second document, the second document 
including markup language; 

finding a first corresponding subset of the second set of data, the first 
corresponding subset having a correspondence to the first selected subset; and 

finding a second corresponding^ stibset of the second set of data, the second 



corresponding subset having a correspon 



m/c to the second selected subset. 



SO 244. The method of claim 243, wherdfn^ocument data is at least partly from the first 
document. 

245. The method of claim 243, wherein document data is at least partly from the second 
document. 



246. The method of claim 243, wherein the secdpd document is received if the second 
document is different from the first document. 



247. The method of claim 243, wherein the markup^ language includes at least HTML 
30 (Hypertext Markup Language). 
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248. The method ot\ claim 243, wherein the markup language includes at least one of 
XML, a subset of XMIA and a specialization of XML (extensible Markup Language). 

249. The method of claim 243, wherein the markup language includes at least WML 
(Wireless Markup Language). 

250. The method of claim\243, wherein the markup language includes at least one of 
SGML, a subset of SGML, arljl a specialization of SGML (Standard Generalized Markup 
Language). 

25 1 . The method of claim 243, Wherein the markup language includes at least text-based 
content and tags, the tags detailing one or more of structure of content, semantics of 
content, and formatting informationWbout text-based content. 



determining a first edit sequ 
and at least part of a second tree represen 



252. The method of claim 243, further comprising: 

if two or more corresponding data are found, then: 

selecting larger selected Vdat a, at least part of the larger selected data 
including a larger subtree in a first tree tfgp^es^ntation of the first set of data, the larger 
subtree including the selected data; 

kween at least part of the first set of data 
of the second set of data, the first set of 
data including at least part of the larger selectedydata, the first edit sequence including any 
of insertions, deletions, and substitutions; 

finding corresponding data of the second sk of data, the corresponding data having 
a correspondence to the larger selected data, the correspondence at least partly found by 
determining the first edit sequence; and 

finding corresponding data of the second set 6$ data, the corresponding data having 
a correspondence to the selected data, the correspondence at least partly found by 
determining the first edit sequence. 



253. The method of claim 243, wherein one or more pf the first set of data and the 
second set of data is represented at least partly by a tree. 
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254. The method of claim 243, wherein one or more of the first set of data and the 
second set of data isVepresented at least partly by a set of linearized tokens. 

255. The method ofVlaim 243, wherein the first document and the second document 
represent different documents. 

256. The method of claim 243, wherein the first document and the second document 
represent a same document. \ 

257. The method of claim 243, wherein the first document and the second document 
represent different versions of a siame document. 

258. A method of extraction, comprising: 

accessing at least a first set of Wa of a first document, the first document including 
markup language, wherein the first setVf data includes selected data, the selected data at 
least partly specifying document data; \ 

accessing at least a second set of daft a of a second document, the second document 
including markup language; 1 yy^s 

determining document data of the s|c2hd sc/t of data, by finding corresponding data 
of the second set of data, the corresponding datet having a correspondence to the selected 
data of the first set of data, the correspondence Vt least partly determined by a first tree- 
based edit sequence between at least part of the first set of data and at least part of the 
second set of data, the first tree-based edit sequence\ncluding any of insertions, deletions, 
and substitutions; \ 

identifying the corresponding data of the second set of data as selected data of the 
second set of data, the selected data at least partly specifying document data; 

accessing at least a third set of data of a third\document, the third document 
including markup language; and \ 

determining document data of the third set of data, iW finding corresponding data 
of the third set of data, the corresponding data having a correspondence to at least one of 
the selected data of the first set of data and the selected data of\the second set of data, the 
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correspondence at least partly determined by a second tree-based edit sequence between at 
least part of the third set of data and at least one of at least part of the first set of data and 
at least part of the second set of data, the second tree-based edit sequence including any of 
insertions, deletions, anasubstitutions. 

259. The method of claim 258, wherein at least one of the first tree-based edit sequence 
and the second tree-based\edit sequence includes none of insertions, deletions, and 
substitutions. \ 

260. The method of claim 258, wherein at least one of the first tree-based edit sequence 
and the second tree-based edit sequence includes at least one of one or more insertions, 
one or more deletions, and one or more substitutions. 

261. The method of claim 258, wherein at least one of the first tree-based edit sequence 
and the second tree-based edit sequence is at least partly determined by calculating a total 
cost, and each of one or more of insertions, deletions, substitutions, and matches is 
associated with one or more costs. \ ✓-v 

262. The method of claim 261, v/h^em^G^io or more costs are at least partly set to 
encourage the tree-based edit sequence to Wffude one or more matches between at least 
some markup language from the selected data of the first document and at least some 
markup language from the second document^ the markup language including text-based 
content and tags. \ 

263. The method of claim 261, wherein a first cost is associated with a first match at a 
first distance from a root of a tree representationVof some set of data, a second cost is 
associated with a second match at a second distanceMrom a root of a tree representation of 
some set of data, the first distance is less than the sdfcond distance, and the first cost and 
the second cost are set to encourage the first match more than the second match. 

264. The method of claim 261, wherein a first cost is Associated with a first insertion at 
a first distance from a root of a tree representation of some set of data, a second cost is 
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associated with a second insertion at a second distance from a root of a tree representation 
of some set of data, theVirst distance is less than the second distance, and the first cost and 
the second cost are different. 

265. The method of clainA261, wherein a first cost is associated with a first deletion at a 
first distance from a root of\a tree representation of some set of data, a second cost is 
associated with a second deletfbn at a second distance from a root of a tree representation 
of some set of data, the first distance is less than the second distance, and the first cost and 
the second cost are different. 

266. The method of claim 261, v&ierein a first cost is associated with a first substitution 
at a first distance from a root of a trse representation of some set of data, a second cost is 
associated with a second substitution at a second distance from a root of a tree 
representation of some set of data, the Yirst distance is less than the second distance, and 
the first cost and the second cost are different. 



267. The method of claim 261, where 
content substitution such that a first 



finjt cost is associated with a first text-based 
lekgnrspf substituting text-based content is 
substantially equal to a first length of fcubstnuterd text-based content, a second cost is 
associated with a second text-based consent Substitution such that a second length of 
substituting text-based content is substantially different from a second length of substituted 
text-based content, and the first cost and the second cost are set to discourage the second 
text-based content substitution more than the first raxt-based content substitution. 

268. The method of claim 261, wherein markup Wiguage includes at least text-based 
content and tags, and the one or more costs are\at least partly set to discourage 
substitutions of text-based content for one or more tags J 

269. The method of claim 261, wherein markup language includes at least text-based 
content and tags, and the one or more costs are at yeast partly set to discourage 
substitutions of one or more tags for text-based content. 
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270. The method ©f claim 261, wherein a first cost is associated with preserving a first 
tag with unchanged attributes, a second cost is associated with preserving a second tag 
with one or more chaiaged attributes, and the first cost and the second cost are set to 
discourage preserving the second tag more than preserving the first tag. 

5 \ 

271. The method of claim 258, wherein subsequent sets of data of documents are 
received, the documents including markup language, document data of the subsequent sets 
of data are determined by finding corresponding data of the subsequent sets of data, the 
corresponding data of the subsequent sets correspond to the selected data of earlier sets of 

10 data, the corresponding data ofYthe subsequent sets are identified as selected data of the 
subsequent sets of data, the selected data of the subsequent sets of data at least partly 
3 specifying document data, and aft least one of selected data of the earlier sets and the 
5 selected data of the subsequent dam at least partly determine corresponding data of later 
~ sets of data, the earlier sets of data are received earlier than the subsequent sets of data, and 

Li \ 

Cl5 the later sets of data are received later than the subsequent sets of data. 



272. The method of claim 258, wherl 
document. 



document data is at least partly from the first 



20 273. The method of claim 258, wherei^d^cument data is at least partly from the second 
document. 



25 



274. The method of claim 258, wherein document data is at least partly from the third 
document. 

275. The method of claim 258, wherein the secon^ document is received if the second 
document is different from the first document. 



276. The method of claim 258, wherein the markup language includes at least HTML 
30 (Hypertext Markup Language). 
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277. The method of claim 258, wherein the markup language includes at least one of 
XML, a subset of XMIi and a specialization of XML (extensible Markup Language). 

278. The method of claim 258, wherein the markup language includes at least WML 
5 (Wireless Markup Language). 

279. The method of claim\258, wherein the markup language includes at least one of 
SGML, a subset of SGML, aim a specialization of SGML (Standard Generalized Markup 
Language). \ 

10 \ 

280. The method of claim 25 8, Wherein the markup language includes at least text-based 
C3 content and tags, the tags detailing one or more of structure of content, semantics of 
3 content, and formatting information\about text-based content. 

^15 28 1 . The method of claim 258, further comprising: 

rg if two or more corresponding data are found, then: 

!■=* selecting larger selectedWat^ at least part of the larger selected data 

;^ including a larger subtree in a first trpAe^sentation of the first set of data, the larger 
Q subtree including the selected data; \| \ / 

j"20 determining a third treelbased/^dit sequence between at least part of the 

first set of data and at least part of a sedprid Wee representation of the second set of data, 
the first set of data including at least part of the larger selected data, the third tree-based 
edit sequence including any of insertions, deletions, and substitutions; 

finding corresponding data of the second sW of data, the corresponding data having 
25 a correspondence to the larger selected data, the Correspondence at least partly found by 
determining the third tree-based edit sequence; and \ 

finding corresponding data of the second set of data, the corresponding data having 
a correspondence to the selected data, the correspondence at least partly found by 
determining the third tree-based edit sequence. \ 
30 \ 

282/ The method of claim 258, wherein one or more of the first set of data and the 
second set of data is represented at least partly by a tree. \ 
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283 The method af claim 258, wherein one or more of the first set of data and the 
second set of data is represented at least partly by a set of linearized tokens. 

5 284. The method of claim 258, wherein at least two of the first document, the second 
document, and the third document represent different documents. 



10 



'S5* 
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285. The method of claim 258, wherein at least two of the first document, the second 
document, and the third document represent a same document. 

286. The method of claim 25 8\ wherein at least two of the first document, the second 
document, and the third document represent different versions of a same document. 

287 The method of claim 258, wherein at least one of the first tree-based edit sequence 
and the second tree-based edit sequenqp includes a tree-based tree-based edit sequence. 

288. The method of claim 258, \\|l]Sjreir/ determining the tree-based edit sequence 
comprises: 

determining at least one tree-b isedVedit sequence of forward and backward edit 
sequences between at least part of a firsHree representation of the first set of data and at 
least part of a second tree representation of the\second set of data; 

performing at least one of 1) and 2): 

la) pruning a relevant subtree from aMeast part of the first tree representation, 
the relevant subtree at least partly determined >|rom the forward and backward edit 
sequences; 

lb) determining a pruned tree-based edit sequence between the pruned relevant 
subtree and at least part of the second tree representation; 

2a) pruning a relevant subtree from at Meast part of the second tree 
representation, the relevant subtree at least partly determined from the forward and 
backward edit sequences; \ 

2b) determining a pruned tree-based edit sequence^ between at least part of the 
first tree representation and the pruned relevant subtree; and \ 
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finding corresponding data of the second set of data, the corresponding data having 
a correspondence to the selected data, the correspondence at least partly found by 
determining the pruned tqee-based edit sequence. 

5 289. A method of extraction, comprising: 

accessing at least a fiist set of data of a first document, the first document including 
markup language, wherein the first set of data includes selected data, the selected data at 
least partly specifying document data; 

accessing at least a second set of data of a second document, the second document 
1 0 including markup language; \ 

finding one or more sets ©f corresponding data of the second set of data, each of 
£3 one or more sets of corresponding Hata having a strength of correspondence to the selected 
i. 3 data of the first set of data, the strength of correspondence at least partly determined by 
I s * some tree-based edit sequence between at least part of the second set of data and at least 
5 part of the first set of data, the trea-based edit sequence including any of insertions, 
i!fl deletions, and substitutions; \ 

if two or more sets of corresponding data are found, then 1) if one of the 
corresponding sets of data has a subslasfe^ higher strength of correspondence than 
O strengths of correspondence of the othferxorresponding sets of data, assigning a high 
£S0 measure of quality to the selection of the^teelected data, and 2) if none of the corresponding 
sets of data has a substantially higher strehgth of correspondence than strengths of 
correspondence of the other corresponding sets of data, assigning a low measure of quality 
to the selection of the selected data. \ 

25 290. The method of claim 289, wherein the treeVbased edit sequence includes none of 
insertions, deletions, and substitutions. \ 

291. The method of claim 289, wherein the tree-based edit sequence includes at least 
one of one or more insertions, one or more deletions, andVne or more substitutions. 
30 \\ 
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292. The methodXof claim 289, wherein the tree-based edit sequence is at least partly 
determined by calculating a total cost, and each of one or more of insertions, deletions, 
substitutions, and matcnes is associated with one or more costs. 



10 



293. The method of claim 292, wherein the one or more costs are at least partly set to 
encourage the tree-based edit sequence to include one or more matches between at least 
some markup language from the selected data of the first document and at least some 
markup language from the second document, the markup language including text-based 
content and tags. 

294. The method of claim 292\ wherein a first cost is associated with a first match at a 
first distance from a root of a tree representation of some set of data, a second cost is 
associated with a second match at aWcond distance from a root of a tree representation of 
some set of data, the first distance is\less than the second distance, and the first cost and 
the second cost are set to encourage thafirst match more than the second match. 



.20 



295. The method of claim 292, whereirX^rst cost is associated with a first insertion at 
a first distance from a root of a tree repij^\ejltation of some set of data, a second cost is 
associated with a second insertion at a secjpfi^ ^stance from a root of a tree representation 
of some set of data, the first distance is les^fhipi the second distance, and the first cost and 
the second cost are different. 



296. The method of claim 292, wherein a first cost is associated with a first deletion at a 
first distance from a root of a tree representation of some set of data, a second cost is 
25 associated with a second deletion at a second distance from a root of a tree representation 
of some set of data, the first distance is less than the second distance, and the first cost and 
the second cost are different. 



297. The method of claim 292, wherein a first cost is associated with a first substitution 
30 at a first distance from a root of a tree representation of some set of data, a second cost is 
associated with a second substitution at a second distance from a root of a tree 
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representatio^Sof some set of data, the first distance is less than the second distance, and 
the first cost anctahe second cost are different. 

298. The method Vf claim 292, wherein a first cost is associated with a first text-based 
5 content substitution \uch that a first length of substituting text-based content is 
substantially equal to £^ first length of substituted text-based content, a second cost is 
associated with a seconck text-based content substitution such that a second length of 
substituting text-based content is substantially different from a second length of substituted 
text-based content, and the ttrst cost and the second cost are set to discourage the second 
10 text-based content substitutionVnore than the first text-based content substitution. 

!: 2 299. The method of claim 292V wherein markup language includes at least text-based 
content and tags, and the one or more costs are at least partly set to discourage 
i : s substitutions of text-based content fo\ one or more tags. 

W 300. The method of claim 292, wherein markup language includes at least text-based 
□ content and tags, and the one or mo\e/co&s are at least partly set to discourage 
is! substitutions of one or more tags for text-b^eddontent. 

b^20 301. The method of claim 292, whereiA a m-st eJost is associated with preserving a first 
tag with unchanged attributes, a second cosjxis associated with preserving a second tag 
with one or more changed attributes, and the torst cost and the second cost are set to 
discourage preserving the second tag more than preserving the first tag. 

25 302. The method of claim 289, wherein documents data is at least partly from the first 
document. \ 

303. The method of claim 289, wherein document data\s at least partly from the second 
document. \ 

30 \ 

304. The method of claim 289, wherein the second document is received if the second 
document is different from the first document. 
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305. The method\of claim 289, wherein the markup language includes at least HTML 
(Hypertext Markup iWguage). 

5 306. The method of ftlaim 289, wherein the markup language includes at least one of 
XML, a subset of XML, hnd a specialization of XML (extensible Markup Language). 

307. The method of claim 289, wherein the markup language includes at least WML 
(Wireless Markup Language! 

10 \ 

308. The method of claim 289, wherein the markup language includes at least one of 
l ;i SGML, a subset of SGML, andV specialization of SGML (Standard Generalized Markup 
'*3 Language). \ 

X5l 5 309. The method of claim 289, wftereiH^he markup language includes at least text-based 

is0 content and tags, the tags detailing \one Qr more of structure of content, semantics of 

O content, and formatting information abWftex^-based content. 

310. The method of claim 289, further Wnprising: 
K20 if two or more corresponding data ate found, then: 

selecting larger selected data, at least part of the larger selected data 
including a larger subtree in a first tree representation of the first set of data, the larger 
subtree including the selected data; \ 

determining a second tree-based edit sequence between at least part of the 
25 first set of data and at least part of a second tree Vepresentation of the second set of data, 
the first set of data including at least part of the larger selected data, the second tree-based 
edit sequence including any of insertions, deletions, and substitutions; 

finding corresponding data of the second set cvf data, the corresponding data having 
a correspondence to the larger selected data, the correspondence at least partly found by 
30 determining the second tree-based edit sequence; and \ 
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finding corresponding data of the second set of data, the corresponding data having 
a correspondence toy the selected data, the correspondence at least partly found by 
determining the second tree-based edit sequence. 

5 311. The method of claim 289, wherein one or more of the first set of data and the 
second set of data is represented at least partly by a tree. 

312. The method of claim 289, wherein one or more of the first set of data and the 
second set of data is represented at least partly by a set of linearized tokens. 

10 \ 

313. The method of claim 28V wherein the first document and the second document 
J'q represent different documents. \ 

Lij 314. The method of claim 289, wherein the first document and the second document 
,«*15 represent a same document. \ 

Q 315. The method of claim 289, whire\n the first document and the second document 
}\\ represent different versions of a same flpcmient^ 

j ^20 316. The method of claim 289, wherejkirt feast one of the first tree-based edit sequence 
and the second tree-based edit sequence includes a tree-based tree-based edit sequence. 

317. The method of claim 289, wherein determining the tree-based edit sequence 
comprises: \ 
25 determining at least one tree-based edit sequence of forward and backward edit 

sequences between at least part of a first tree representation of the first set of data and at 
least part of a second tree representation of the second set of data; 
performing at least one of 1) and 2): \ 

la) pruning a relevant subtree from at least part of the first tree representation, 
30 the relevant subtree at least partly determined from thd forward and backward edit 
sequences; \ 
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lb) determining ^pruned tree-based edit sequence between the pruned relevant 
subtree and at least part of the second tree representation; 

2a) pruning a relevant subtree from at least part of the second tree 
representation, the relevant subtreg^at^least partly determined from the forward and 
backward edit sequences; 

2b) determining a prurfifed trVe-bas^d edit sequence between at least part of the 
first tree representation and the pnigedj^e^ant subtree; and 

finding corresponding data of the second set of data, the corresponding data having 
a correspondence to the selected data, tl^e correspondence at least partly found by 
determining the pruned tree-based edit sequence. 
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